Feature Selection in Proteomic Pattem Data with Support Vector Machines
ثبت نشده
چکیده
This paper introduces novel methods for feature selec tion (FS) based on support vector machines (SVM). The methods combine feature subsets produced by a variant of SVM-RFE, a popular feature ranking/selection algorithm based on SVM. Two combination strategies are proposed: union of features occurring frequently, and ensemble of classifiers built on single feature subsets. The resulting methods are applied to pattern proteomic data for tumor diagnostics. Results of experiments on three proteomic pattern datasets indicate that combining feature subsets affects positively the prediction accuracy of both SVM and SVM-RFE. A discussion about the biological interpretation of selected features is provided. I. I n t r o d u c t i o n FS can be formalized as a combinatorial optimization problem, finding the feature set maximizing the quality of the hypothesis learned from these features. FS is viewed as a major bottleneck of supervised learning and data mining [1], [2]. For the sake of the learning performance, it is highly desirable to discard irrelevant features prior to learning, especially when the number of available features significantly outnumbers the number of examples, as is the case in bioinformatics. In particular, biological experiments from laboratory tech nologies like microarray and proteomic techniques, generate data with very high number of attributes, in general much larger than the number of examples. Therefore FS provides a fundamental step in the analysis of such type of data [3]. By selecting only a subset of attributes, the prediction accuracy can possibly improve and more insight in the nature of the prediction problem can be gained. A number of effective FS methods for classification rank features and discard those whose rank is smaller than a given threshold [1], [4]. This threshold can be either provided by the user, like in [5], or automatically determined, like in [6], by means of the estimated rank of a new random feature. A popular algorithm based on the above approach is SVMRFE [5]. It is an iterative algorithm. Each iteration consists of the following two steps. First feature weights, obtained by training a linear SVM on the training set, are used in a scoring function for ranking features. Next, the feature with minimum rank is removed from the data. In this way, a chain of feature subsets of decreasing size is obtained. SVM classifiers are trained on training sets restricted to the feature subsets, and the classifier with best predictive performance is selected. In the original SVM-RFE algorithm one feature is discarded at each iteration. Other choices are suggested in [5], where at each iteration features with rank lower than a user-given theshold are removed. The choice of the threshold affects the results of SVM-RFE. Heuristics for choosing a threshold value have been proposed [5], [6]. In this paper the problem of choosing a threshold is sidestepped by considering multiple runs of SVM-RFE with different thresholds. Each run produces one feature subset. The resulting feature subsets are combined in order to obtain a robust result/classification. Two methods for building a classifier from a combination of feature subsets are proposed, called JOIN and ENSEMBLE. JOIN generates a classifier by training SVM on data restricted to those features that occur more than a given number of times in the list of feature subsets. ENSEMBLE generates a majority vote ensemble of classifiers, where each classifier is obtained by training SVM on data restricted to one feature subset. This combination strategy is used, e.g., in [7], where decision trees trained on data restricted to randomly selected feature subsets are ensembled. JOIN and ENSEMBLE are compared experimentally with SVM trained on all features, and with a multistart version of SVM-RFE. Multistart SVM-RFE performs multiple runs of SVM-RFE with different thresholds, and selects among the resulting feature subsets the one minimizing the error (on hold-out set) of SVM trained on data restricted to that feature subset. The four methods are applied to pattern proteomic data from cancer and healthy patients. This type of data is used for cancer detection and potential biomarker identification. Motivations for choosing FS methods based on linear SVM are their robustness with respect to high dimension input data, and the experimental observation that such data appear to be almost linearly separable (see e.g., [8], [9]). Experiments are conducted on three pattern proteomic data from prostate and ovarian cancer. On two of the three datasets JOIN and ENSEMBLE achieve significantly better predic tive accuracy than SVM and multistart SVM-RFE. On the third dataset JOIN obtains perfect classification and the other methods almost perfect classification. The results indicate that FS methods combining feature subsets from multiple runs provide a robust and effective approach for feature selection in proteomic pattern data. The paper is organized as follows. Section II gives an overview of the considered FS methodology. Section III describes the data used in the experiments. Section IV reports on results of the experiments. The paper ends with a discussion and points to future research.
منابع مشابه
Anomaly Detection Using SVM as Classifier and Decision Tree for Optimizing Feature Vectors
Abstract- With the advancement and development of computer network technologies, the way for intruders has become smoother; therefore, to detect threats and attacks, the importance of intrusion detection systems (IDS) as one of the key elements of security is increasing. One of the challenges of intrusion detection systems is managing of the large amount of network traffic features. Removing un...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملA Feature Selection Method Based on a Support Vector Machine and the Cumulative Distribution Function
Feature selection is an important issue in the research areas of machine learning and data mining. It reduces the dimensionality of data and enhances the performance of data analysis and interpretability, such as clustering or classification algorithms. This paper proposes a feature selection method based on support vector machines and distance-based cumulative distribution functions. This meth...
متن کاملFeature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملRobust SVM-Based Biomarker Selection with Noisy Mass Spectrometric Proteomic Data
Computational analysis of mass spectrometric (MS) proteomic data from sera is of potential relevance for diagnosis, prognosis, choice of therapy, and study of disease activity. To this aim, feature selection techniques based on machine learning can be applied for detecting potential biomarkes and biomaker patterns. A key issue concerns the interpretability and robustness of the output results g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017